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Monte Carlo simulations of a simple lattice model of protein folding show two distinct regimes 
depending on the chain length. The first regime well describes the folding of small protein sequences 
and its kinetic counterpart appears to be single exponential in nature, while the second regime is 
typical of sequences longer than 80 amino acids and the folding performance achievable is sensitive 
to target conformation. The extent to which stability, as measured by the energy of a sequence in 
the target, is an essential requirement and affects the folding dynamics of protein molecules in the 
first regime is investigated. The folding dynamics of sequences whose design stage was restricted 
to a certain fraction of randomly selected amino acids shows that while some degree of stability 
is a necessary and sufficient condition for successful folding, designing sequences that provide the 
lowest energy in the target seems to be a superfluous constraint. By studying the dynamics of 
under annealed but otherwise freely designed sequences we explore the relation between stability 
and kinetic accessibility. We find that there is no one-to-one correspondence between having low 
energy and folding quickly to the target, as only a small fraction of the most stable sequences were 
also found to fold relatively quickly. 
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I. INTRODUCTION 

In the early sixties, the Nobel Laureate Christian An- 
finsen showed through in vitro experiments that denatu- 
rated proteins can refold to their original native structure 
in the absence of any catalyst jlj]. This suggested that 
protein folding (PF) can be a spontaneous, first-order 
process, and that the only information required for the 
protein sequence to fold correctly is the sequence itself. 
Thus, in Anfinsen's perspective, sequence is the only de- 
terminant of the rates and mechanisms of folding. Soon 
after these discoveries, Levinthal pointed out that a ran- 
dom search of the conformational space, as implied by 
Anfinsen's thermodynamic hypothesis, could not explain 
the time scale of folding as observed in Nature 0]. To 
bypass this paradox, Levinthal proposed a kinetic view, 
that proteins must fold through some directed process, 
whose nature could involve for instance, the existence of 
folding pathways. The latter do not necessarily imply a 
fixed sequence of events in folding, nor do they require 
the existence of observable folding intermediates. 

In the late 1980's, Brygenlson and Wolynes ^ intro- 
duced the concept of energy landscape — the free energy 
as a function of protein conformation — as an attempt 
to reconcil the kinetic and thermodynamic views of PF. 
The 'topology' of the landscape characterizes the folding 
kinetics through the existence of folding pathways. In 
Fig. |l| we show a cross section of the energy landscape of 
a hypothetical random heteropolymer: as a consequence 
of the existence of numerous local energy minima these 
sequences tend to behave as highly frustrated systems. 
Nevertheless, it has been shown that a very small frac- 



tion of random sequences are able to stably fold to what 
can be considered their native states (a deep energy min- 
imum of the energy landscape), in a biologically accept- 
able timescale Q . 

Dill and co-workers jq] proposed that a stable, fast fold- 
ing protein sequence must satisfy two essential require- 
ments: thermodynamic stability meaning the existence 
of a deep global minimum in the energy landscape and 
kinetic accessibility meaning the existence of a basin of 
attraction sloping toward that minimum. This basin of 
attraction, first proposed by Leopold et al ^ has become 
one of the most important concepts in protein folding dy- 
namics and is commonly known as the folding funnel. 

In 1993, Shakhnovich and Gutin Q developed a de- 
sign method with the purpose of creating protein like 
sequences, that is, sequences that fold fast and stably to 
their respective native structures. The method is based 
on the thermodynamic stability requirement, and was in- 
spired by the behavior displayed by that very small frac- 
tion of protein like random sequences. Shakhnovich jjj 
claims that a thermodynamic driven sequence selection 
actually solves the kinetic problem as well, which in turn 
suggests a correlation between thermodynamic stability 
and kinetic accessibility. This design method has been 
widely used to study the folding dynamics of protein se- 
quences whose length ranges from a few to at least 80 
beads in simple lattice models. 

Is it possible to fold longer protein chains (> 80 amino 
acids) using the thermodynamic stability as the unique 
driver of sequence design? This open question was the 
starting point and initial motivation for the present pa- 
per. As in previous studies, Monte Carlo simulations of 
a simple lattice model were used to tackle the problem. 
In the present case we particularly focus on the effects of 
native state structure on the folding dynamics. 

This paper is organized as follows. Sec. II reviews the 
model as well as the numerical methods used to sample 
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FIG. 1. Energy landscape of a random heteropolymer (up- 
per figure) and a protein landscape exhibiting a folding funnel. 
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FIG. 2. Move set used to generate the dynamics. 




both conformational and sequence space. In Sec. Ill the 
numerical results are presented. We start by studying 
the dependence of the folding time on temperature and 
then explore the extent to which the former is a sequence 
specific kinetic property. Then, we present evidence for 
the existence of two distinct dynamical regimes in PF. Fi- 
nally, for sequences which to fold in accordance to what 
we call the first regime, we explore the effects of stabil- 
ity on folding dynamics and also how it correlates with 
kinetic accessibility. In Sec. IV we make some final com- 
ments and conclusions. 



II. MODELS AND METHODS 

A. Folding in conformational space 

Part of the contribution of computational physics to 
protein folding includes several findings obtained in the 
scope of simple lattice models. Most generally, these 
models consider a coarse-grained description of the pro- 
tein, reduced to its backbone structure through a 'bead 
& stick' representation. Each bead is ascribed a certain 
chemical identity representing an amino acid type, while 
each stick stands for the peptide bond that covalently 
connects amino acids along the polypeptide chain. This 
structure is allowed to move in a three dimensional infi- 
nite lattice, subjected to excluded volume constraints and 
exploring the conformational space in accordance with a 
kink-jump dynamics obeying the restriction that no bond 
length changes (Fig. |^). The energy of a conformation 
is given by the contact Hamiltonian 



FIG. 3. Example of a 125 bead long target found by ho- 
mopolymer relaxation. 
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where {(Ji} represents an amino acid sequence, <jj stand- 
ing for the chemical identity of bead i, while {r{\ is the 
set of bead coordinates that define the conformation in 
question. The contact function A equals 1 if beads i 
and j are in contact but not covalently linked and is 
otherwise. We follow many previous studies in taking 
the interaction parameters e from the 20 x 20 Myazawa- 
Jerningan matrix, derived from the distribution of con- 
tacts in native proteins pi] ]. Eq. (jj) defines an energy 
function in conformational space whose graph is the en- 
ergy landscape. The energy landscape is explored via 
the Metropolis Monte Carlo (MC) acceptance rule. This 
means that downhill transitions (that lower the energy) 
are accepted with probability unity and uphill transitions 
with probability proportional to the Boltzmann factor. 
Thus if and Tb are two distinct conformations, and 
AH = Ha — Hb is the energy difference between the 
states defined by these conformations, 



1 if AH < 

exp[-AH/k B T] if AH > 0, 



(2) 



where P is the transition probability, T the temperature 
and ks the Boltzmann constant. 
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B. Design in sequence space 

The first step at the design stage is the choice of the 
target structure. A target is a maximally compact and 
otherwise arbitrary structure such as the one shown in 
Fig. ^[ Maximally compact structures maximize the 
number of contacts for a certain chain length and there- 
fore minimize the degeneracy of the Hamiltonian (|l]). 
The goal of the design process is, given a target, to 
find a sequence that folds to it efficiently, and we follow 
Shakhnovich and Gutin jjj in attempting this by seek- 
ing the sequence with the lowest possible energy in the 
target state, as given by Eq. (|l|). For this purpose, tar- 
get's co-ordinates are quenched and the energy of Eq. (|l|) 
must be annealed with respect to the sequence variables. 
This naturally leads to the idea of simulated annealing in 
sequence space; starting from a random initial sequence, 
transitions between different sequences arc successively 
attempted — different sequences are generated by ran- 
domly permutating pairs of beads — along with a suitable 
annealing schedule. If Sa and Sb are two different se- 
quences the transition probability, Ps A _»s B , comes given 
Eq. (^) with the temperature replaced by Tj = aTi^i and 
< a < 1. This optimization procedure has been coined 
simulated annealing in sequence space. It was the first 
successful procedure to design protein sequences. 



III. IN VIRTUO RESULTS 

A. Finding the optimal folding temperature 

For each number of monomers N — 27,36,48,64,80 
and 100 we found 5 different maximally compact target 
structures by homopolymer relaxation. This method is 
an efficient way to systematically find kinetically acces- 
sible maximally compact structures and was previously 
used by Abkevich et al in j|] for 36 bead long targets. 
Then, for each target we prepared a set composed of 30 
sequences by applying the Shakhnovich method as pre- 
viously outlined. Next, we went on to determine the 
optimal folding temperature, Tf Q id- For this purpose, a 
designed sequence at each N was randomly selected and 
subjected to MC folding simulations at several temper- 
atures. For N = 27, 36,48 and 64 a set of 50 MC runs 
was performed for each temperature and the folding time 
t was taken as the value of the mean first passage time 
(FPT) to the target averaged over the 50 MC runs. Re- 
sults plotted in Fig. ^(a) show that there is an optimal 
folding temperature, Tf id, where folding to the target 
structure proceeds relatively fast. However, away from 
this optimum, both at higher and lower temperatures 
the process gets increasingly slower. In the first case 
the protein sequence tends to behave like a random het- 
eropolymer rapidly fluctuating between unfolded states. 
In the second case the folding kinetics gets slower because 
there is a high probability for the chain to get trapped 



into metastable states and folding entails overcomming 
the corresponding energy barriers |12|. 

For N — 80 and N — 100 the number of successful 
folding runs per each studied temperature, was only one 
half of the attempted total. For this reason, we choose 
Tf id as the temperature where the highest ratio of fold- 
ing success could be observed. 

It has been claimed that the folding time and tem- 
perature are both sequence specific parameters p3[ . To 
investigate this issue, five 48 bead long sequences were 
randomly selected (one sequence per target) and their 
folding behavior was studied over a temperature range 
as shown in Fig. [|(b). These show that whilst folding 
to the target can be quite target and (or) sequence de- 
pendent, the optimal folding temperature is close to a 
self-averaging quantity in our simulations. It should be 
emphasised that our sequence design preserved overall 
chemical composition, in contrast to earlier work E3] 
which for unrestricted binary alphabet sequences found 
the optimal temperature to be more sequence dependent. 



B. Dependence of the folding probability on folding 
time: evidence for two folding regimes. 

We have explored the time dependence of folding for 
the different chain lengths N each at their respective op- 
timal folding temperature Tf id(N). Specifically we re- 
port in Fig. @ the probability Pf id(t) of the chain having 
visited its target conformation after time t. A first look 
at the graph suggests that for N up to 64 the curves ap- 
pear to be functionally similar. A scaling factor of the 
form t = (N /N) a t translates in the logarithmic plot to 
a shift 



log t = log t + a log 



N 
~N 



(3) 



Taking a = 5 (and N = 48) the shifted curves N = 27, 
36, 48 and 64 superimpose well as shown in Fig. |[ 
For N > 80 this superposition regime breaks down and 
the asymptotic value of Pf id(t) decreases quite consid- 
erably. Fig. shows that the break in the folding be- 
haviour is associated with the onset of target-dependence 
of the folding curves, where we have computed Pfoidif) 
separately for each of the five targets at both N = 64 
and N = 100. For N = 64 all the targets exhibit a 
similar functional dependence of Pfoia on log(t). All ap- 
pear consistent with asymptotic values of Pfoid — ► 1 and 
also the dispersion of the folding time is small. However, 
for N — 100 four out of the five targets have apparent 
asymptotics Pfoid < 1 and there is considerably larger 
dispersion of the folding time. 

One possible explanation for this change in behaviour 
which we have ruled out, is that for N > 80 the optimal 
folding temperature might become a target sensitive pa- 
rameter. To test this hypothesis, we randomly selected 
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FIG. 4. (a) Dependence of the folding time t on the inverse 
temperature 1/T for several values of the chain length. Note 
that as N increases the temperature range where folding can 
be observed narrows. Fig. mb) shows the same dependence 
for six different 48 bead long sequences trainned to the same 
target. 



five 100 bead long sequences (one per each target), and 
ran 20 folding simulations per each value of the temper- 
ature in a certain temperature range. Fig. ^ shows how 
foldicity, defined as the fraction of successful folding runs 
over the total number of attempted runs, changes with 
inverse temperature for each target. Except for target 
5, all the others exhibit a maximum value of foldicity 
at what we considered the optimal folding temperature. 
Target 5 marginally shows a shallow minimum and is in 
any case the fastest and most successful to fold in our 
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FIG. 5. Dependence of the folding probability (P/ id) on 
log(t). For each curve 150 simulations were used distributed 
across the available targets and sequences. Pfoid was calcu- 
lated as the number of folding simulations which ended up to 
time t normalized to the total number of runs. 
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FIG. 6. Shifted curves for P fold vs log(t) for TV = 27, 
N = 36 and N = 64. These curves were obtained by shifting 
the folding times by a factor of 51og(4|). 
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previous results (Fig. 0(b)), so we can safely rule out tar- 
get sensitive optimal folding temperature as the cause of 
target sensitive folding performance. 

A primary conclusion that can be drawn from these re- 
sults is that for N > 80 folding dynamics becomes target 
selective, certain targets being more kinetically accessible 
than others. Having analised only five targets per chain 
length, we are not in a position to characterise quanti- 
tatively the resulting distribution of behaviour. Never- 
theless, the emergence of this new dynamical feature is a 
clear indication that for N > 80 thermodynamic stability 
is not the dominant drive in folding, and in particular it 
does not solve the kinetic accessibility problem as it has 
been previously suggested. 



C. Folding kinetics-dependence of the folding time 
(t) on the chain length (N) 

The folding time t is a kinetic property that measures 
how fast a protein sequence folds into its native state from 
an initial unfolded coil. It is known that the fastest sim- 
ple, single domain protein folds a million time faster than 
the slowest However, and despite this broad kinetic 
spectrum, there seems to be a general consensus among 
biochemists that protein kinetics falls into two main gen- 
eral classes. The analysis of experimental data collected 
in the course of the last ten years has put forward the 
theory that proteins smaller than 100 amino acids are 
committed to follow a two-state (or single exponential) 
kinetics. The transition state is the only kinetically im- 
portant intermediate and conformational searching is the 
only factor limiting folding speed. Bigger proteins, on the 
other hand generally fold in agreement with a mutiexpo- 
nential kinetics. The latter often involves fast collapse 
into kinetic traps and subsequent slower barrier climb- 
ing out of the traps. This generates an overall process 
characterised by the existence and accumulation of more 
than one important kinetic intermediate. Our results are 
in a broad agreement with this scenario, and the plot of 
(1 — Pfoid) in Fig. H supports identifying our regime for 
N < 80 with single exponential kinetics. 

Recall from Fig. ^], that tuning a to 5 in Eq. (0) nicely 
superimposes the curves of Pfoid vs log t for N up to 64. 
This suggests that in our first regime a scaling law of 
the type t ~ N 5 appropriately describes the dependence 
of the folding time, t, with the chain length, N. The 
plot of In t vslnN in Fig. [To] confirms (with a significant 
correlation of 0.99) an exponent of 5.27. 

Note that for N > 80 the mean FPT does not yield a 
correct estimate of the folding time because the value of 
Pfoid does not tend to one in the limit of large t. There- 
fore we can only analyse the dependence of t on N for 
TV < 80. 

A scaling law of the form t ph N 4 was suggested by a 
previous estimate by Gutin et al Jl^] from folding simula- 
tions to targets which are not maximally compact struc- 
tures. The fact that these targets have a higher kinetic 
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FIG. 7. Separate contribution of each one of the 64 
(Fig. |(a)) and 100 (Fig. |^(b)) bead long targets for the de- 
pendence of the folding probability (Pfoid) on \og(t) . Pfoid 
was calculated as the number of folding simulations which 
ended up to time t normalized to the total number of runs. 
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FIG. 8. Dependence of foldicity on the inverse tempera- 
ture, 1/T for each of the 100 bead long targets. Except for 
target 5, all the other targets exhibit a maximum value of 
foldicity at what we considered to be the optimal folding tem- 
perature. 

accessibility than the ones we considered might explain 
the weaker dependence obtained for the folding time on 
the chain length. 

D. Effects of stability on folding dynamics 

In this section we explore the extent to which stability, 
as measured by the sequence energy in the target affects 
the folding performance of proteins in our first regime. 
For this purpose we studied the MFPT to five target 
states of several 48 bead long sequences whose training 
was handicapped by fixing a priori a certain fraction r jj x 
of the beads. We designed three sets of 30 sequences per 
target, each set corresponding to an rf ix of 0.01, 0.17 
and 0.25 respectively. This biases the design procedure 
to sequences higher in energy as Tfi x increases allowing us 
to explore an energy range of AE tzs 5. The MC folding 
simulations were again performed at Tf Q id and proceeded 
up to n s = 9 x 10 8 MC steps or until folding was observed. 
Firstly we analyse the effects of stability on foldicity. In 
Fig. nil the main plot shows how foldicity changes with 
Tfi x , while in the inner plot, the averaged mean sequence 
energy is plotted against the same parameter. It can be 
seen that foldicity is insensitive to raising the energy up 
to an average value of E ss —19 (corresponding to r/j x = 
0.17) but above this threshold it sharply decreases. Wc 
should stress however that E ~ —19 is still well below the 
threshold heteropolymer energy, Eq [[§, for this specific 
chemical composition (E c ) = -7.3849952±1. 7438 where 
the average is taken over the five considered targets. 
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FIG. 9. Evidence for a single exponential kinetic regime for 
N = 48 and N = 64 but not 100. 
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FIG. 10. Dependence of the folding time t on the chain 
length N for proteins that follow a two-state kinetics. For 
N = 27, 36, 48 and 64 we computed the folding time as the 
mean FPT averaged over 150 simulation runs distributed over 
the five considered targets. We stress that the use of different 
targets to compute this plot does not smear the result because 
there is not target selectivity in this folding regime. 
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FIG. 11. Dependence of foldicity on the fraction of fixed 
monomers, rfi X (main plot) for TV = 48. The inner plot shows 
the dependence of the mean averaged sequence energy with 
r fix- 



Fig. [12] shows the folding probability vs time for each 
value of rfi x . It can be seen that while the Tfi x = 0.17 
curve is only shifted from the rf ix — and rf ix =0.1 
curves (which translates in a slower folding dynamics), 
for rfi x > 0.17 the folding regime clearly breaks away. 
Curiously, the curves corresponding to rfi X — and 
ffix = 0.10 nicely superpose. In energetic terms this 
translates into a break in the folding regime for energies 
higher than E « —19. 

Taken together these results suggest that some degree 
of stability is a sufficient condition for folding, controlling 
and efficiently driving the dynamics of small protein se- 
quences. This in turn agrees with the scenario of folding 
being essentially a downhill process to the native state 
(energy minimum). However, designing sequences that 
provide the lowest energy in the target seems to be a 
superfious constraint. 



E. Stability and kinetic accessibility 

The picture we can draw from the results presented so 
far is that for sequences whose length does not exceed 80 
amino acids, thermodinamically oriented design ensures 
successful folding to the targets. As previously stated, it 
has been claimed [[L2| that not only do these sequences 
fold stably (in the sense of allowing a high target aver- 
age time occupancy), but they also fold quicker, which 
suggests a correlation (at least to some extent) between 
stability and kinetic accessibility. 

Fig. [l3| shows an accessibility-stability plot; accessibil- 
ity is measured by the folding time t (averaged mean FPT 
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FIG. 12. Dependence of the folding probability, Pjoid, on 
log(f) for several values of the fraction of fixed monomers, 



over 10 simulation runs) and each point represents a 48 
bead long sequence with energy E that folds to the tar- 
get in time t. Sequences constrained by different values 
of rfi x are distinguished. 

The graph strongly indicates that the quickest folders 
are not necessarily the most stable sequences. In a previ- 
ous report, Fink and Ball fL5|, jl§], @. showed through 
the study of a convenient ensemble of 27 bead long se- 
quences, that in the region of sufficient thermodynamic 
stability, the latter is in conflict with optimal accessibil- 
ity, and that a significant increase in kinetic performance 
will be achieved if a marginal increase in the target's en- 
ergy is allowed. 

In order to investigate how stability and accessibility 
correlate for 48 bead long protein sequences, we prepared 
an ensemble of « 1000 sequences, but instead of freezing 
all of them, some were annealed to some temperature dif- 
ferent from zero. This allows us to scan a representative 
fraction of the accessibility-stability phase space. Results 
are plotted Fig. |l4|, where once again we are taking t as 
the averaged FPT over 10 simulation runs. The graph 
shows that the connection between stability and accessi- 
bility, is not that of a simple correlation. In particular, it 
can be seen that although the quickest folders appear in 
an energy range of high stability (—22 < E < —23), the 
most stable sequences do not show highest accessibility. 



IV. CONCLUSIONS 

The assumption that 'target structure' could be an im- 
portant parameter in the dynamics of protein folding, 
and its subsequent introduction in the folding simula- 
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FIG. 13. Accessibility-stability plot for the 48 bead long 
sequences whose design stage was handicapped by fixing a 
priori a certain number of beads. 




tions of a simple lattice model, made it possible to dis- 
criminate between two distinct regimes in the dynam- 
ics. The first regime well describes the folding of small 
protein molecules, and its kinetics appears to be single 
exponential. In this case, conformational search must 
be the only factor limiting folding speed. Folding time 
scales with chain length as t « N a in this regime. On the 
other hand, the folding of protein sequences bigger than 
80 amino acids appears to be target sensitive with prone 
to a dynamics which we might interpret as the falling in 
kinetic traps strongly delaying folding to the target. 

The extent to which stability, as measured by the se- 
quence energy in the target, controls folding of proteins 
that fall in the first regime was investigated. Results 
agree with the idea that for small protein molecules, sta- 
bility is a necessary and sufficient condition for success- 
ful folding. However, desigining sequences for minimal 
energy in the target conformation appears to be super- 
fluous. 

The controversial claim that the most stable sequences 
are also the quickest folders was investigated. Notwith- 
standing the fact that this is a delicate issue given the 
considerable dimension of sequence space and the difficult 
task of suitably sampling it, our results (taken from 1000 
sequences) strongly indicate that the correlation between 
stability and accessibility is essentially small. 

As a general conclusion we can say that there is much 
more than thermodynamics in protein folding. In partic- 
ular for long protein chains it is evident that target ge- 
ometry matters and thermodynamic factors are not the 
sole determinant of folding performance. 

There is a growing idea among biologists, that struc- 
tural factors are a key point in protein folding dynam- 
ics. In this context and in the scope of lattice models, 
it is urgent to test the correlation that biochemists find 
between CO (contact order-the average sequence separa- 
tion among contacting residue pairs) and folding rates. 
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